MENU

Here, I want to discuss the main probability distribution (based on my humble knowledge). Probability is the area that I am so fascinated with because there are many applications in several science topics. The principal probability distributions necessary to understand the whole process regarding inference and applying statistical models are Bernoulli, Binomial, Negative-Binomial, Poisson, Normal, and Gamma. Of course, there are many other essential distributions that I am not to discourse here. I will try to explain the support and parameters beyond the idea behind each one.

Bernoulli distribution

The first one is the most famous distribution, is the Bernoulli distribution. Let X a binary random variable with probability density function (PDF) f_{x} . Then, X \sim Ber(p) has PDF

f(x) = \mathrm{P}(X = x) = p^{x} (1-p)^{1-x}

where the support is X \in \{0, 1\} and parametric space is p \in (0, 1). The expected value (mathematical expectation) is \mathrm{E}(X) = p and variance is \mathrm{Var}(X) = p(1 - p).

I will not discuss moments in statistics here where the first moment is mathematical expectations and the second is related to variance. Nevertheless, Wikipedia is a good site where you might start to study more about this topic. I love this concept because everything concerning statistical models is linked to a mean, mainly in generalized linear models (MLG). But it is a topic to see forward.

Bellow, there is a code about fifteen realizations from Bernoulli distribution. You can see that there is a chart, where the x-axis is X = 1 and X = 0, and y-axis is \mathrm{P}(X = 1) and \mathrm{P}(X = 0), respectively. And other propriety that we need to have in mind is \mathrm{P}(X = 1) + \mathrm{P}(X = 0) = 1.

set.seed(123)
value <- seq(1e-06, 0.999999, by = 0.001)
p <- sample(value, size = 15, replace = TRUE)
q <- 1 - p

data0 <- cbind(`X = 1` = p, `X = 0` = q)

barplot(data0, beside = TRUE, main = "Bernoulli distribution",
    xlab = "Realization of the variable",
    ylab = "Probability", col = rainbow(15))

The Bernoulli distribution has a huge spotlight in many areas, mainly because of its applications. For whatever response variable you have whose response is good/bad or two options, the model logistic regression will be the model appropriated to study.

Binomial distribution

The Binomial distribution is essential primarily due to its application in the experimental area of health, agronomy, or other sciences. Let X a binary random variable with probability density function (PDF) f_{x}. Then, X \sim Bin(n, p) has PDF

f(x) = \mathrm{P}(X = x) = {{n}\choose{x}} p^{x} (1-p)^{n-x}

where the support is X \in \{0, 1, \dots, n\} – number of successes and parametric space is p \in (0, 1) success probability for each trial and n \in \{0, 1, \dots\} - number of trials. The expected value is \mathrm{E}(X) = np and variance is \mathrm{Var}(X) = np(1 - p).

For this last one, n, I would prefer to treat it as fix value than a parameter. It happens because you will have always been with this value previously. And the concept of the parameter is to estimate from the sample and not to have it before. It might have the same idea or be called a hyperparameter, such as machine learning techniques.

The graph below shows us how different p could affect the density curve of Binomial distribution.

REFERENCES

Agresti, Alan. 2015. Foundations of Linear and Generalized Linear Models. John Wiley & Sons.
DeGroot, Morris H, and Mark J Schervish. 2012. Probability and Statistics. Pearson Education.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Rigby, Robert A, Mikis D Stasinopoulos, Gillian Z Heller, and Fernanda De Bastiani. 2019. Distributions for Modeling Location, Scale, and Shape: Using GAMLSS in r. CRC press.
Create a front page